diff --git a/chat/core/src/main/python/few_shot_example/sql_exampler.py b/chat/core/src/main/python/few_shot_example/sql_exampler.py index 454144f85..aeeb4c79f 100644 --- a/chat/core/src/main/python/few_shot_example/sql_exampler.py +++ b/chat/core/src/main/python/few_shot_example/sql_exampler.py @@ -1,371 +1,360 @@ -examplars = [ - { - "current_date": "2020-12-01", - "table_name": "内容库产品", - "fields_list": """["部门", "模块", "用户名", "访问次数", "访问人数", "访问时长", "数据日期"]""", - "question": "比较jackjchen和robinlee在内容库的访问次数", - "prior_schema_links": """['jackjchen'->用户名, 'robinlee'->用户名]""", +examplars= [ + { "current_date":"2020-12-01", + "table_name":"内容库产品", + "fields_list":"""["部门", "模块", "用户名", "访问次数", "访问人数", "访问时长", "数据日期"]""", + "question":"比较jackjchen和robinlee在内容库的访问次数", + "prior_schema_links":"""['jackjchen'->用户名, 'robinlee'->用户名]""", "analysis": """让我们一步一步地思考。在问题“比较jackjchen和robinlee在内容库的访问次数“中,我们被问: -“比较jackjchen和robinlee”,所以我们需要column=[用户名] -”内容库的访问次数“,所以我们需要column=[访问次数] -基于table和columns,可能的cell values 是 = ['jackjchen', 'robinlee']。""", - "schema_links": """["用户名", "访问次数", "'jackjchen'", "'robinlee'"]""", - "sql": """select 用户名, 访问次数 from 内容库产品 where 用户名 in ('jackjchen', 'robinlee') and 数据日期 = '2020-12-01' """, - }, - { - "current_date": "2022-11-06", - "table_name": "内容库产品", - "fields_list": """["部门", "模块", "用户名", "访问次数", "访问人数", "访问时长", "数据日期"]""", - "question": "内容库近12个月访问人数 按部门", - "prior_schema_links": """[]""", +“比较jackjchen和robinlee”,所以我们需要column=[用户名],cell values = ['jackjchen', 'robinlee'],所以有[用户名:('jackjchen', 'robinlee')] +”内容库的访问次数“,所以我们需要column=[访问次数]""", + "schema_links":"""["用户名":("'jackjchen'", "'robinlee'"), "访问次数"]""", + "sql":"""select 用户名, 访问次数 from 内容库产品 where 用户名 in ('jackjchen', 'robinlee')""" + }, + { "current_date":"2022-11-06", + "table_name":"内容库产品", + "fields_list":"""["部门", "模块", "用户名", "访问次数", "访问人数", "访问时长", "数据日期"]""", + "question":"内容库近12个月访问人数 按部门", + "prior_schema_links":"""[]""", "analysis": """让我们一步一步地思考。在问题“内容库近12个月访问人数 按部门“中,我们被问: -”内容库近12个月“,所以我们需要column=[数据日期] +”内容库近12个月“,所以我们需要column=[数据日期],cell values = [12],所以有[数据日期:(12)] “访问人数”,所以我们需要column=[访问人数] -”按部门“,所以我们需要column=[部门] -基于table和columns,可能的cell values 是 = [12]。""", - "schema_links": """["访问人数", "部门", "数据日期", 12]""", - "sql": """select 部门, 数据日期, 访问人数 from 内容库产品 where datediff('month', 数据日期, '2022-11-06') <= 12 """, - }, - { - "current_date": "2023-04-21", - "table_name": "内容库产品", - "fields_list": """["部门", "模块", "用户名", "访问次数", "访问人数", "访问时长", "数据日期"]""", - "question": "内容库美术部、技术研发部的访问时长", - "prior_schema_links": """['美术部'->部门, '技术研发部'->部门]""", +”按部门“,所以我们需要column=[部门]""", + "schema_links":"""["数据日期":(12), "访问人数", "部门"]""", + "sql":"""select 部门, 数据日期, 访问人数 from 内容库产品 where datediff('month', 数据日期, '2022-11-06') <= 12 """ + }, + { "current_date":"2023-04-21", + "table_name":"内容库产品", + "fields_list":"""["部门", "模块", "用户名", "访问次数", "访问人数", "访问时长", "数据日期"]""", + "question":"内容库美术部、技术研发部的访问时长", + "prior_schema_links":"""['美术部'->部门, '技术研发部'->部门]""", "analysis": """让我们一步一步地思考。在问题“内容库美术部、技术研发部的访问时长“中,我们被问: “访问时长”,所以我们需要column=[访问时长] -”内容库美术部、技术研发部“,所以我们需要column=[部门] -基于table和columns,可能的cell values 是 = ['美术部', '技术研发部']。""", - "schema_links": """["访问时长", "部门", "'美术部'", "'技术研发部'"]""", - "sql": """select 部门, 访问时长 from 内容库产品 where 部门 in ('美术部', '技术研发部') and 数据日期 = '2023-04-21' """, - }, - { - "current_date": "2023-08-21", - "table_name": "严选", - "fields_list": """["严选版权归属系", "付费模式", "结算播放份额", "付费用户结算播放份额", "数据日期"]""", - "question": "近3天海田飞系MPPM结算播放份额", - "prior_schema_links": """['海田飞系'->严选版权归属系]""", +”内容库美术部、技术研发部“,所以我们需要column=[部门], cell values = ['美术部', '技术研发部'],所以有[部门:('美术部', '技术研发部')]""", + "schema_links":"""["访问时长", "部门":("'美术部'", "'技术研发部'")]""", + "sql":"""select 部门, 访问时长 from 内容库产品 where 部门 in ('美术部', '技术研发部')""" + }, + { "current_date":"2023-08-21", + "table_name":"严选", + "fields_list":"""["严选版权归属系", "付费模式", "结算播放份额", "付费用户结算播放份额", "数据日期"]""", + "question":"近3天海田飞系MPPM结算播放份额", + "prior_schema_links":"""['海田飞系'->严选版权归属系]""", "analysis": """让我们一步一步地思考。在问题“近3天海田飞系MPPM结算播放份额“中,我们被问: -“MPPM结算播放份额”,所以我们需要column=[结算播放份额] -”海田飞系“,所以我们需要column=[严选版权归属系] -”近3天“,所以我们需要column=[数据日期] -基于table和columns,可能的cell values 是 = ['海田飞系', 3]。""", - "schema_links": """["结算播放份额", "严选版权归属系", "数据日期", "'海田飞系'", 3]""", - "sql": """select 严选版权归属系, 结算播放份额 from 严选 where 严选版权归属系 = '海田飞系' and datediff('day', 数据日期, '2023-08-21') <= 3 """, - }, - { - "current_date": "2023-05-22", - "table_name": "歌曲库", - "fields_list": """["是否潮流人歌曲", "C音歌曲ID", "C音歌曲MID", "歌曲名", "歌曲版本", "语种", "歌曲类型", "翻唱类型", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "结算播放量", "运营播放量", "付费用户结算播放量", "历史累计结算播放量", "运营搜播量", "结算搜播量", "运营完播量", "运营推播量", "近7日复播率", "日均搜播量", "数据日期"]""", - "question": "对比近7天翻唱版和纯音乐的歌曲播放量", - "prior_schema_links": """['纯音乐'->语种, '翻唱版'->歌曲版本]""", +“MPPM结算播放份额”,所以我们需要column=[结算播放份额], +”海田飞系“,所以我们需要column=[严选版权归属系], cell values = ['海田飞系'],所以有[严选版权归属系:('海田飞系')], +”近3天“,所以我们需要column=[数据日期], cell values = [3],所以有[数据日期:(3)]""", + "schema_links":"""["结算播放份额", "严选版权归属系":("'海田飞系'"), "数据日期":(3)]""", + "sql":"""select 严选版权归属系, 结算播放份额 from 严选 where 严选版权归属系 = '海田飞系' and datediff('day', 数据日期, '2023-08-21') <= 3 """ + }, + { "current_date":"2023-05-22", + "table_name":"歌曲库", + "fields_list":"""["是否潮流人歌曲", "C音歌曲ID", "C音歌曲MID", "歌曲名", "歌曲版本", "语种", "歌曲类型", "翻唱类型", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "结算播放量", "运营播放量", "付费用户结算播放量", "历史累计结算播放量", "运营搜播量", "结算搜播量", "运营完播量", "运营推播量", "近7日复播率", "日均搜播量", "数据日期"]""", + "question":"对比近7天翻唱版和纯音乐的歌曲播放量", + "prior_schema_links":"""['纯音乐'->语种, '翻唱版'->歌曲版本]""", "analysis": """让我们一步一步地思考。在问题“对比近3天翻唱版和纯音乐的歌曲播放量“中,我们被问: “歌曲播放量”,所以我们需要column=[结算播放量] -”翻唱版“,所以我们需要column=[歌曲版本] -”和纯音乐的歌曲“,所以我们需要column=[语种] -”近7天“,所以我们需要column=[数据日期] -基于table和columns,可能的cell values 是 = ['翻唱版', '纯音乐', 7]。""", - "schema_links": """["结算播放量", "歌曲版本", "语种", "数据日期", "'翻唱版'", "'纯音乐'", 7]""", - "sql": """select 歌曲版本, 语种, 结算播放量 from 歌曲库 where 歌曲版本 = '翻唱版' and 语种 = '纯音乐' and datediff('day', 数据日期, '2023-05-22') <= 7 """, - }, - { - "current_date": "2023-05-31", - "table_name": "艺人库", - "fields_list": """["上下架状态", "歌手名", "歌手等级", "歌手类型", "歌手来源", "MPPM潮流人等级", "活跃区域", "年龄", "歌手才能", "歌手风格", "粉丝数", "潮音粉丝数", "超声波粉丝数", "推博粉丝数", "超声波歌曲数", "在架歌曲数", "超声波分享数", "独占歌曲数", "超声波在架歌曲评论数", "有播放量歌曲数", "数据日期"]""", - "question": "对比一下陈拙悬、孟梅琦、赖媚韵的粉丝数", - "prior_schema_links": """['1527896'->MPPM歌手ID, '1565463'->MPPM歌手ID, '2141459'->MPPM歌手ID]""", +”翻唱版“,所以我们需要column=[歌曲版本], cell values = ['翻唱版'],所以有[歌曲版本:('翻唱版')] +”和纯音乐的歌曲“,所以我们需要column=[语种], cell values = ['纯音乐'],所以有[语种:('纯音乐')] +”近7天“,所以我们需要column=[数据日期], cell values = [7],所以有[数据日期:(7)]""", + "schema_links":"""["结算播放量", "歌曲版本":("'翻唱版'"), "语种":("'纯音乐'"), "数据日期":(7)]""", + "sql":"""select 歌曲版本, 语种, 结算播放量 from 歌曲库 where 歌曲版本 = '翻唱版' and 语种 = '纯音乐' and datediff('day', 数据日期, '2023-05-22') <= 7 """ + }, + { "current_date":"2023-05-31", + "table_name":"艺人库", + "fields_list":"""["上下架状态", "歌手名", "歌手等级", "歌手类型", "歌手来源", "MPPM潮流人等级", "活跃区域", "年龄", "歌手才能", "歌手风格", "粉丝数", "潮音粉丝数", "超声波粉丝数", "推博粉丝数", "超声波歌曲数", "在架歌曲数", "超声波分享数", "独占歌曲数", "超声波在架歌曲评论数", "有播放量歌曲数", "数据日期"]""", + "question":"对比一下陈拙悬、孟梅琦、赖媚韵的粉丝数", + "prior_schema_links":"""['1527896'->MPPM歌手ID, '1565463'->MPPM歌手ID, '2141459'->MPPM歌手ID]""", "analysis": """让我们一步一步地思考。在问题“对比一下陈拙悬、孟梅琦、赖媚韵的粉丝数“中,我们被问: “粉丝数”,所以我们需要column=[粉丝数] -”陈拙悬、孟梅琦、赖媚韵“,所以我们需要column=[歌手名] -基于table和columns,可能的cell values 是 = ['陈拙悬', '孟梅琦', '赖媚韵']。""", - "schema_links": """["粉丝数", "歌手名", "'陈拙悬'", "'孟梅琦'", "'赖媚韵'"]""", - "sql": """select 歌手名, 粉丝数 from 艺人库 where 歌手名 in ('陈拙悬', '孟梅琦', '赖媚韵') and 数据日期 = '2023-05-31' """, - }, - { - "current_date": "2023-07-31", - "table_name": "歌曲库", - "fields_list": """["歌曲名", "歌曲版本", "歌曲类型", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", - "question": "播放量大于1万的歌曲有多少", - "prior_schema_links": """[]""", +”陈拙悬、孟梅琦、赖媚韵“,所以我们需要column=[歌手名], cell values = ['陈拙悬', '孟梅琦', '赖媚韵'],所以有[歌手名:('陈拙悬', '孟梅琦', '赖媚韵')]""", + "schema_links":"""["粉丝数", "歌手名":("'陈拙悬'", "'孟梅琦'", "'赖媚韵'")]""", + "sql":"""select 歌手名, 粉丝数 from 艺人库 where 歌手名 in ('陈拙悬', '孟梅琦', '赖媚韵')""" + }, + { "current_date":"2023-07-31", + "table_name":"歌曲库", + "fields_list":"""["歌曲名", "歌曲版本", "歌曲类型", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", + "question":"播放量大于1万的歌曲有多少", + "prior_schema_links":"""[]""", "analysis": """让我们一步一步地思考。在问题“播放量大于1万的歌曲有多少“中,我们被问: “歌曲有多少”,所以我们需要column=[歌曲名] -”播放量大于1万的“,所以我们需要column=[结算播放量] -基于table和columns,可能的cell values 是 = [10000]。""", - "schema_links": """["歌曲名", "结算播放量", 10000]""", - "sql": """select 歌曲名 from 歌曲库 where 结算播放量 > 10000 and 数据日期 = '2023-07-31' """, - }, - { - "current_date": "2023-07-31", - "table_name": "内容库产品", - "fields_list": """["用户名", "部门", "模块", "访问时长", "访问次数", "访问人数", "数据日期"]""", - "question": "内容库访问时长小于1小时,且来自美术部的用户是哪些", - "prior_schema_links": """['美术部'->部门]""", +”播放量大于1万的“,所以我们需要column=[结算播放量], cell values = [10000],所以有[结算播放量:(10000)]""", + "schema_links":"""["歌曲名", "结算播放量":(10000)]""", + "sql":"""select 歌曲名 from 歌曲库 where 结算播放量 > 10000""" + }, + { "current_date":"2023-07-31", + "table_name":"内容库产品", + "fields_list":"""["用户名", "部门", "模块", "访问时长", "访问次数", "访问人数", "数据日期"]""", + "question":"内容库访问时长小于1小时,且来自美术部的用户是哪些", + "prior_schema_links":"""['美术部'->部门]""", "analysis": """让我们一步一步地思考。在问题“内容库访问时长小于1小时,且来自美术部的用户是哪些“中,我们被问: “用户是哪些”,所以我们需要column=[用户名] -”美术部的“,所以我们需要column=[部门] -”访问时长小于1小时“,所以我们需要column=[访问时长] -基于table和columns,可能的cell values 是 = ['美术部', 1]。""", - "schema_links": """["用户名", "部门", "访问时长", "'美术部'", 1]""", - "sql": """select 用户名 from 内容库产品 where 部门 = '美术部' and 访问时长 < 1 and 数据日期 = '2023-07-31' """, - }, - { - "current_date": "2023-08-31", - "table_name": "内容库产品", - "fields_list": """["用户名", "部门", "模块", "访问时长", "访问次数", "访问人数", "数据日期"]""", - "question": "内容库pv最高的用户有哪些", - "prior_schema_links": """[]""", +”美术部的“,所以我们需要column=[部门], cell values = ['美术部'],所以有[部门:('美术部')] +”访问时长小于1小时“,所以我们需要column=[访问时长], cell values = [1],所以有[访问时长:(1)]""", + "schema_links":"""["用户名", "部门":("'美术部'"), "访问时长":(1)]""", + "sql":"""select 用户名 from 内容库产品 where 部门 = '美术部' and 访问时长 < 1""" + }, + { "current_date":"2023-08-31", + "table_name":"内容库产品", + "fields_list":"""["用户名", "部门", "模块", "访问时长", "访问次数", "访问人数", "数据日期"]""", + "question":"内容库pv最高的用户有哪些", + "prior_schema_links":"""[]""", "analysis": """让我们一步一步地思考。在问题“内容库pv最高的用户有哪些“中,我们被问: “用户有哪些”,所以我们需要column=[用户名] -”pv最高的“,所以我们需要column=[访问次数] -基于table和columns,可能的cell values 是 = []。""", - "schema_links": """["用户名", "访问次数"]""", - "sql": """select 用户名 from 内容库产品 where 数据日期 = '2023-08-31' order by 访问次数 desc limit 10 """, - }, - { - "current_date": "2023-08-31", - "table_name": "艺人库", - "fields_list": """["播放量层级", "播放量单调性", "播放量方差", "播放量突增类型", "播放量集中度", "歌手名", "歌手等级", "歌手类型", "歌手来源", "MPPM潮流人等级", "结算播放量", "运营播放量", "历史累计结算播放量", "有播放量歌曲数", "历史累计运营播放量", "付费用户结算播放量", "结算播放量占比", "运营播放份额", "免费用户结算播放占比", "完播量", "数据日期"]""", - "question": "近90天袁亚伟播放量平均值是多少", - "prior_schema_links": """['152789226'->MPPM歌手ID]""", +”pv最高的“,所以我们需要column=[访问次数], cell values = [1],所以有[访问次数:(1)]""", + "schema_links":"""["用户名", "访问次数":(1)]""", + "sql":"""select 用户名 from 内容库产品 order by 访问次数 desc limit 1""" + }, + { "current_date":"2023-08-31", + "table_name":"艺人库", + "fields_list":"""["播放量层级", "播放量单调性", "播放量方差", "播放量突增类型", "播放量集中度", "歌手名", "歌手等级", "歌手类型", "歌手来源", "MPPM潮流人等级", "结算播放量", "运营播放量", "历史累计结算播放量", "有播放量歌曲数", "历史累计运营播放量", "付费用户结算播放量", "结算播放量占比", "运营播放份额", "免费用户结算播放占比", "完播量", "数据日期"]""", + "question":"近90天袁亚伟播放量平均值是多少", + "prior_schema_links":"""['152789226'->MPPM歌手ID]""", "analysis": """让我们一步一步地思考。在问题“近90天袁亚伟播放量平均值是多少“中,我们被问: “播放量平均值是多少”,所以我们需要column=[结算播放量] -”袁亚伟“,所以我们需要column=[歌手名] -”近90天“,所以我们需要column=[数据日期] -基于table和columns,可能的cell values 是 = ['袁亚伟', 90]。""", - "schema_links": """["结算播放量", "歌手名", "数据日期", "'袁亚伟'", 90]""", - "sql": """select avg(结算播放量) from 艺人库 where 歌手名 = '袁亚伟' and datediff('day', 数据日期, '2023-08-31') <= 90 """, - }, - { - "current_date": "2023-08-31", - "table_name": "艺人库", - "fields_list": """["播放量层级", "播放量单调性", "播放量方差", "播放量突增类型", "播放量集中度", "歌手名", "歌手等级", "歌手类型", "歌手来源", "MPPM潮流人等级", "结算播放量", "运营播放量", "历史累计结算播放量", "有播放量歌曲数", "历史累计运营播放量", "付费用户结算播放量", "结算播放量占比", "运营播放份额", "免费用户结算播放占比", "完播量", "数据日期"]""", - "question": "周倩倩近7天结算播放量总和是多少", - "prior_schema_links": """['199509'->MPPM歌手ID]""", +”袁亚伟“,所以我们需要column=[歌手名], cell values = ['袁亚伟'],所以有[歌手名:('袁亚伟')] +”近90天“,所以我们需要column=[数据日期], cell values = [90],所以有[数据日期:(90)]""", + "schema_links":"""["结算播放量", "歌手名":("'袁亚伟'"), "数据日期":(90)]""", + "sql":"""select avg(结算播放量) from 艺人库 where 歌手名 = '袁亚伟' and datediff('day', 数据日期, '2023-08-31') <= 90 """ + }, + { "current_date":"2023-08-31", + "table_name":"艺人库", + "fields_list":"""["播放量层级", "播放量单调性", "播放量方差", "播放量突增类型", "播放量集中度", "歌手名", "歌手等级", "歌手类型", "歌手来源", "MPPM潮流人等级", "结算播放量", "运营播放量", "历史累计结算播放量", "有播放量歌曲数", "历史累计运营播放量", "付费用户结算播放量", "结算播放量占比", "运营播放份额", "免费用户结算播放占比", "完播量", "数据日期"]""", + "question":"周倩倩近7天结算播放量总和是多少", + "prior_schema_links":"""['199509'->MPPM歌手ID]""", "analysis": """让我们一步一步地思考。在问题“周倩倩近7天结算播放量总和是多少“中,我们被问: “结算播放量总和是多少”,所以我们需要column=[结算播放量] -”周倩倩“,所以我们需要column=[歌手名] -”近7天“,所以我们需要column=[数据日期] -基于table和columns,可能的cell values 是 = ['周倩倩', 7]。""", - "schema_links": """["结算播放量", "歌手名", "数据日期", "'周倩倩'", 7]""", - "sql": """select sum(结算播放量) from 艺人库 where 歌手名 = '周倩倩' and datediff('day', 数据日期, '2023-08-31') <= 7 """, - }, - { - "current_date": "2023-09-14", - "table_name": "内容库产品", - "fields_list": """["部门", "模块", "用户名", "访问次数", "访问人数", "访问时长", "数据日期"]""", - "question": "内容库访问次数大于1k的部门是哪些", - "prior_schema_links": """[]""", +”周倩倩“,所以我们需要column=[歌手名], cell values = ['周倩倩'],所以有[歌手名:('周倩倩')] +”近7天“,所以我们需要column=[数据日期], cell values = [7],所以有[数据日期:(7)]""", + "schema_links":"""["结算播放量", "歌手名":("'周倩倩'"), "数据日期":(7)]""", + "sql":"""select sum(结算播放量) from 艺人库 where 歌手名 = '周倩倩' and datediff('day', 数据日期, '2023-08-31') <= 7 """ + }, + { "current_date":"2023-09-14", + "table_name":"内容库产品", + "fields_list":"""["部门", "模块", "用户名", "访问次数", "访问人数", "访问时长", "数据日期"]""", + "question":"内容库访问次数大于1k的部门是哪些", + "prior_schema_links":"""[]""", "analysis": """让我们一步一步地思考。在问题“内容库访问次数大于1k的部门是哪些“中,我们被问: “部门是哪些”,所以我们需要column=[部门] -”访问次数大于1k的“,所以我们需要column=[访问次数] -基于table和columns,可能的cell values 是 = [1000]。""", - "schema_links": """["部门", "访问次数", 1000]""", - "sql": """select 部门 from 内容库产品 where 访问次数 > 1000 and 数据日期 = '2023-09-14' """, - }, - { - "current_date": "2023-09-18", - "table_name": "歌曲库", - "fields_list": """["歌曲名", "MPPM歌手ID", "歌曲版本", "歌曲类型", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", - "question": "陈亿训唱的所有的播放量大于20k的孤勇者有哪些", - "prior_schema_links": """['199509'->MPPM歌手ID, '1527123'->MPPM歌曲ID]""", +”访问次数大于1k的“,所以我们需要column=[访问次数], cell values = [1000],所以有[访问次数:(1000)]""", + "schema_links":"""["部门", "访问次数":(1000)]""", + "sql":"""select 部门 from 内容库产品 where 访问次数 > 1000""" + }, + { "current_date":"2023-09-18", + "table_name":"歌曲库", + "fields_list":"""["歌曲名", "MPPM歌手ID", "歌曲版本", "歌曲类型", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", + "question":"陈亿训唱的所有的播放量大于20k的孤勇者有哪些", + "prior_schema_links":"""['199509'->MPPM歌手ID, '1527123'->MPPM歌曲ID]""", "analysis": """让我们一步一步地思考。在问题“陈亿训唱的所有的播放量大于20k的孤勇者有哪些“中,我们被问: -“孤勇者有哪些”,所以我们需要column=[歌曲名] -”播放量大于20k的“,所以我们需要column=[结算播放量] -”陈亿训唱的“,所以我们需要column=[歌手名] -基于table和columns,可能的cell values 是 = [20000, '陈亿训', '孤勇者']。""", - "schema_links": """["歌曲名", "结算播放量", "歌手名", 20000, "'陈亿训'", "'孤勇者'"]""", - "sql": """select 歌曲名 from 歌曲库 where 结算播放量 > 20000 and 歌手名 = '陈亿训' and 歌曲名 = '孤勇者' and 数据日期 = '2023-09-18' """, - }, - { - "current_date": "2023-09-18", - "table_name": "歌曲库", - "fields_list": """["歌曲名", "歌曲版本", "歌手名", "歌曲类型", "发布时间", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", - "question": "周洁轮去年发布的歌曲有哪些", - "prior_schema_links": """['23109'->MPPM歌手ID]""", +“孤勇者有哪些”,所以我们需要column=[歌曲名], cell values = ['孤勇者'],所以有[歌曲名:('孤勇者')] +”播放量大于20k的“,所以我们需要column=[结算播放量], cell values = [20000],所以有[结算播放量:(20000)] +”陈亿训唱的“,所以我们需要column=[歌手名], cell values = ['陈亿训'],所以有[歌手名:('陈亿训')]""", + "schema_links":"""["歌曲名":("'孤勇者'"), "结算播放量":(20000), "歌手名":("'陈亿训'")]""", + "sql":"""select 歌曲名 from 歌曲库 where 结算播放量 > 20000 and 歌手名 = '陈亿训' and 歌曲名 = '孤勇者'""" + }, + { "current_date":"2023-09-18", + "table_name":"歌曲库", + "fields_list":"""["歌曲名", "歌曲版本", "歌手名", "歌曲类型", "发布时间", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", + "question":"周洁轮去年发布的歌曲有哪些", + "prior_schema_links":"""['23109'->MPPM歌手ID]""", "analysis": """让我们一步一步地思考。在问题“周洁轮去年发布的歌曲有哪些“中,我们被问: “歌曲有哪些”,所以我们需要column=[歌曲名] -”去年发布的“,所以我们需要column=[发布时间] -”周洁轮“,所以我们需要column=[歌手名] -基于table和columns,可能的cell values 是 = ['周洁轮', 1]。""", - "schema_links": """["歌曲名", "发布时间", "歌手名", 1, "'周洁轮'"]""", - "sql": """select 歌曲名 from 歌曲库 where datediff('year', 发布时间, '2023-09-18') <= 1 and 歌手名 = '周洁轮' and 数据日期 = '2023-09-18' """, - }, - { - "current_date": "2023-09-11", - "table_name": "艺人库", - "fields_list": """["播放量层级", "播放量单调性", "播放量方差", "播放量突增类型", "播放量集中度", "歌手名", "歌手等级", "歌手类型", "歌手来源", "签约日期", "MPPM潮流人等级", "结算播放量", "运营播放量", "历史累计结算播放量", "有播放量歌曲数", "历史累计运营播放量", "付费用户结算播放量", "结算播放量占比", "运营播放份额", "免费用户结算播放占比", "完播量", "数据日期"]""", - "question": "我想要近半年签约的播放量前十的歌手有哪些", - "prior_schema_links": """[]""", +”去年发布的“,所以我们需要column=[发布时间], cell values = [1],所以有[发布时间:(1)] +”周洁轮“,所以我们需要column=[歌手名], cell values = ['周洁轮'],所以有[歌手名:('周洁轮')]""", + "schema_links":"""["歌曲名", "发布时间":(1), "歌手名":("'周洁轮'")]""", + "sql":"""select 歌曲名 from 歌曲库 where datediff('year', 发布时间, '2023-09-18') <= 1 and 歌手名 = '周洁轮'""" + }, + { "current_date":"2023-09-11", + "table_name":"艺人库", + "fields_list":"""["播放量层级", "播放量单调性", "播放量方差", "播放量突增类型", "播放量集中度", "歌手名", "歌手等级", "歌手类型", "歌手来源", "签约日期", "MPPM潮流人等级", "结算播放量", "运营播放量", "历史累计结算播放量", "有播放量歌曲数", "历史累计运营播放量", "付费用户结算播放量", "结算播放量占比", "运营播放份额", "免费用户结算播放占比", "完播量", "数据日期"]""", + "question":"我想要近半年签约的播放量前十的歌手有哪些", + "prior_schema_links":"""[]""", "analysis": """让我们一步一步地思考。在问题“我想要近半年签约的播放量前十的歌手“中,我们被问: “歌手有哪些”,所以我们需要column=[歌手名] -”播放量前十的“,所以我们需要column=[结算播放量] -”近半年签约的“,所以我们需要column=[签约日期] -基于table和columns,可能的cell values 是 = [0.5, 10]。""", - "schema_links": """["歌手名", "结算播放量", "签约日期", 0.5, 10]""", - "sql": """select 歌手名 from 艺人库 where datediff('year', 签约日期, '2023-09-11') <= 0.5 and 数据日期 = '2023-09-11' order by 结算播放量 desc limit 10""", - }, - { - "current_date": "2023-08-12", - "table_name": "歌曲库", +”播放量前十的“,所以我们需要column=[结算播放量], cell values = [10],所以有[结算播放量:(10)] +”近半年签约的“,所以我们需要column=[签约日期], cell values = [0.5],所以有[签约日期:(0.5)]""", + "schema_links":"""["歌手名", "结算播放量":(10), "签约日期":(0.5)]""", + "sql":"""select 歌手名 from 艺人库 where datediff('year', 签约日期, '2023-09-11') <= 0.5 order by 结算播放量 desc limit 10""" + }, + { "current_date":"2023-08-12", + "table_name":"歌曲库", "fields_list": """["发行日期", "歌曲语言", "歌曲来源", "歌曲流派", "歌曲名", "歌曲版本", "歌曲类型", "发行时间", "数据日期"]""", - "question": "最近一年发行的歌曲中,有哪些在近7天播放超过一千万的", - "prior_schema_links": """[]""", + "question":"最近一年发行的歌曲中,有哪些在近7天播放超过一千万的", + "prior_schema_links":"""[]""", "analysis": """让我们一步一步地思考。在问题“最近一年发行的歌曲中,有哪些在近7天播放超过一千万的“中,我们被问: “发行的歌曲中,有哪些”,所以我们需要column=[歌曲名] -”最近一年发行的“,所以我们需要column=[发行日期] -”在近7天播放超过一千万的“,所以我们需要column=[数据日期, 结算播放量] -基于table和columns,可能的cell values 是 = [1, 10000000]""", - "schema_links": """["歌曲名", "发行日期", "数据日期", "结算播放量", 1, 10000000]""", - "sql": """select 歌曲名 from 歌曲库 where datediff('year', 发行日期, '2023-08-12') <= 1 and datediff('day', 数据日期, '2023-08-12') <= 7 and 结算播放量 > 10000000""", - }, - { - "current_date": "2023-08-12", - "table_name": "歌曲库", +”最近一年发行的“,所以我们需要column=[发行日期], cell values = [1],所以有[发行日期:(1)] +”在近7天播放超过一千万的“,所以我们需要column=[数据日期, 结算播放量], cell values = [7, 10000000],所以有[数据日期:(7), 结算播放量:(10000000)]""", + "schema_links":"""["歌曲名", "发行日期":(1), "数据日期":(7), "结算播放量":(10000000)]""", + "sql":"""select 歌曲名 from 歌曲库 where datediff('year', 发行日期, '2023-08-12') <= 1 and datediff('day', 数据日期, '2023-08-12') <= 7 and 结算播放量 > 10000000""" + }, + { "current_date":"2023-08-12", + "table_name":"歌曲库", "fields_list": """["发行日期", "歌曲语言", "歌曲来源", "歌曲流派", "歌曲名", "歌曲版本", "歌曲类型", "发行时间", "数据日期"]""", - "question": "今年以来发行的歌曲中,有哪些在近7天播放超过一千万的", - "prior_schema_links": """[]""", + "question":"今年以来发行的歌曲中,有哪些在近7天播放超过一千万的", + "prior_schema_links":"""[]""", "analysis": """让我们一步一步地思考。在问题“今年以来发行的歌曲中,有哪些在近7天播放超过一千万的“中,我们被问: “发行的歌曲中,有哪些”,所以我们需要column=[歌曲名] -”今年以来发行的“,所以我们需要column=[发行日期] -”在近7天播放超过一千万的“,所以我们需要column=[数据日期, 结算播放量] -基于table和columns,可能的cell values 是 = [0, 7, 10000000]""", - "schema_links": """["歌曲名", "发行日期", "数据日期", "结算播放量", 0, 7, 10000000]""", - "sql": """select 歌曲名 from 歌曲库 where datediff('year', 发行日期, '2023-08-12') <= 0 and datediff('day', 数据日期, '2023-08-12') <= 7 and 结算播放量 > 10000000""", - }, - { - "current_date": "2023-08-12", - "table_name": "歌曲库", +”今年以来发行的“,所以我们需要column=[发行日期], cell values = [0],所以有[发行日期:(0)] +”在近7天播放超过一千万的“,所以我们需要column=[数据日期, 结算播放量], cell values = [7, 10000000],所以有[数据日期:(7), 结算播放量:(10000000)]""", + "schema_links":"""["歌曲名", "发行日期":(0), "数据日期":(7), "结算播放量":(10000000)]""", + "sql":"""select 歌曲名 from 歌曲库 where datediff('year', 发行日期, '2023-08-12') <= 0 and datediff('day', 数据日期, '2023-08-12') <= 7 and 结算播放量 > 10000000""" + }, + { "current_date":"2023-08-12", + "table_name":"歌曲库", "fields_list": """["发行日期", "歌曲语言", "歌曲来源", "歌曲流派", "歌曲名", "歌曲版本", "歌曲类型", "发行时间", "数据日期"]""", - "question": "2023年以来发行的歌曲中,有哪些在近7天播放超过一千万的", - "prior_schema_links": """['514129144'->MPPM歌曲ID]""", + "question":"2023年以来发行的歌曲中,有哪些在近7天播放超过一千万的", + "prior_schema_links":"""['514129144'->MPPM歌曲ID]""", "analysis": """让我们一步一步地思考。在问题“2023年以来发行的歌曲中,有哪些在近7天播放超过一千万的“中,我们被问: “发行的歌曲中,有哪些”,所以我们需要column=[歌曲名] -”2023年以来发行的“,所以我们需要column=[发行日期] -”在近7天播放超过一千万的“,所以我们需要column=[数据日期, 结算播放量] -基于table和columns,可能的cell values 是 = [2023, 7, 10000000]""", - "schema_links": """["歌曲名", "发行日期", "数据日期", "结算播放量", 2023, 7, 10000000]""", - "sql": """select 歌曲名 from 歌曲库 where YEAR(发行日期) >= 2023 and datediff('day', 数据日期, '2023-08-12') <= 7 and 结算播放量 > 10000000""", - }, - { - "current_date": "2023-08-01", - "table_name": "歌曲库", - "fields_list": """["歌曲名", "歌曲版本", "歌手名", "歌曲类型", "发布时间", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", - "question": "周洁轮2023年6月之后发布的歌曲有哪些", - "prior_schema_links": """['23109'->MPPM歌手ID]""", +”2023年以来发行的“,所以我们需要column=[发行日期], cell values = ['2023-01-01'],所以有[发行日期:('2023-01-01')] +”在近7天播放超过一千万的“,所以我们需要column=[数据日期, 结算播放量], cell values = [7, 10000000],所以有[数据日期:(7), 结算播放量:(10000000)]""", + "schema_links":"""["歌曲名", "发行日期":("'2023-01-01'"), "数据日期":(7), "结算播放量":(10000000)]""", + "sql":"""select 歌曲名 from 歌曲库 where 发行日期 >= '2023-01-01' and datediff('day', 数据日期, '2023-08-12') <= 7 and 结算播放量 > 10000000""" + }, + { "current_date":"2023-08-01", + "table_name":"歌曲库", + "fields_list":"""["歌曲名", "歌曲版本", "歌手名", "歌曲类型", "发布时间", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", + "question":"周洁轮2023年6月之后发布的歌曲有哪些", + "prior_schema_links":"""['23109'->MPPM歌手ID]""", "analysis": """让我们一步一步地思考。在问题“周洁轮2023年6月之后发布的歌曲有哪些“中,我们被问: “歌曲有哪些”,所以我们需要column=[歌曲名] -”2023年6月之后发布的“,所以我们需要column=[发布时间] -”周洁轮“,所以我们需要column=[歌手名] -基于table和columns,可能的cell values 是 = ['周洁轮', 2023, 6]。""", - "schema_links": """["歌曲名", "发布时间", "歌手名", "周洁轮", 2023, 6]""", - "sql": """select 歌曲名 from 歌曲库 where YEAR(发布时间) >= 2023 and MONTH(发布时间) >= 6 and 歌手名 = '周洁轮' and 数据日期 = '2023-08-01' """, - }, - { - "current_date": "2023-08-01", - "table_name": "歌曲库", - "fields_list": """["歌曲名", "歌曲版本", "歌手名", "歌曲类型", "发布时间", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", - "question": "邓梓琦在2023年1月5日之后发布的歌曲中,有哪些播放量大于500W的?", - "prior_schema_links": """['2312311'->MPPM歌手ID]""", +”2023年6月之后发布的“,所以我们需要column=[发布时间], cell values = ['2023-06-01'],所以有[发布时间:('2023-06-01')] +”周洁轮“,所以我们需要column=[歌手名], cell values = ['周洁轮'],所以有[歌手名:('周洁轮')]""", + "schema_links":"""["歌曲名", "发布时间":("'2023-06-01'"), "歌手名":("'周洁轮'")]""", + "sql":"""select 歌曲名 from 歌曲库 where 发布时间 >= '2023-06-01' and 歌手名 = '周洁轮'""" + }, + { "current_date":"2023-08-01", + "table_name":"歌曲库", + "fields_list":"""["歌曲名", "歌曲版本", "歌手名", "歌曲类型", "发布时间", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", + "question":"邓梓琦在2023年1月5日之后发布的歌曲中,有哪些播放量大于500W的?", + "prior_schema_links":"""['2312311'->MPPM歌手ID]""", "analysis": """让我们一步一步地思考。在问题“邓梓琦在2023年1月5日之后发布的歌曲中,有哪些播放量大于500W的?“中,我们被问: -“播放量大于500W的”,所以我们需要column=[结算播放量] -”邓梓琦在2023年1月5日之后发布的“,所以我们需要column=[发布时间] -”邓梓琦“,所以我们需要column=[歌手名] -基于table和columns,可能的cell values 是 = ['邓梓琦', 2023, 1, 5, 5000000]。""", - "schema_links": """["结算播放量", "发布时间", "歌手名", "邓梓琦", 2023, 1, 5, 5000000]""", - "sql": """select 歌曲名 from 歌曲库 where YEAR(发布时间) >= 2023 and MONTH(发布时间) >= 1 and DAY(发布时间) >= 5 and 歌手名 = '邓梓琦' and 结算播放量 > 5000000 and 数据日期 = '2023-08-01'""", - }, - { - "current_date": "2023-09-17", - "table_name": "歌曲库", - "fields_list": """["歌曲名", "歌曲版本", "歌手名", "歌曲类型", "发布时间", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", - "question": "2023年6月以后,张亮英播放量大于200万的歌曲有哪些?", - "prior_schema_links": """['45453'->MPPM歌手ID]""", +“播放量大于500W的”,所以我们需要column=[结算播放量], cell values = [5000000],所以有[结算播放量:(5000000)] +”邓梓琦在2023年1月5日之后发布的“,所以我们需要column=[发布时间], cell values = ['2023-01-05'],所以有[发布时间:('2023-01-05')] +”邓梓琦“,所以我们需要column=[歌手名], cell values = ['邓梓琦'],所以有[歌手名:('邓梓琦')]""", + "schema_links":"""["结算播放量":(5000000), "发布时间":("'2023-01-05'"), "歌手名":("'邓梓琦'")]""", + "sql":"""select 歌曲名 from 歌曲库 where 发布时间 >= '2023-01-05' and 歌手名 = '邓梓琦' and 结算播放量 > 5000000""" + }, + { "current_date":"2023-09-17", + "table_name":"歌曲库", + "fields_list":"""["歌曲名", "歌曲版本", "歌手名", "歌曲类型", "发布时间", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", + "question":"2023年6月以后,张亮英播放量大于200万的歌曲有哪些?", + "prior_schema_links":"""['45453'->MPPM歌手ID]""", "analysis": """让我们一步一步地思考。在问题“2023年6月以后,张亮英播放量大于200万的歌曲有哪些?“中,我们被问: -“播放量大于200万的”,所以我们需要column=[结算播放量] -”2023年6月以后,张亮英“,所以我们需要column=[数据日期, 歌手名] -”歌曲有哪些“,所以我们需要column=[歌曲名] -基于table和columns,可能的cell values 是 = ['张亮英', 2023, 6, 2000000]。""", - "schema_links": """["结算播放量", "数据日期", "歌手名", "张亮英", 2023, 6, 2000000]""", - "sql": """select 歌曲名 from 歌曲库 where YEAR(数据日期) >= 2023 and MONTH(数据日期) >= 6 and 歌手名 = '张亮英' and 结算播放量 > 2000000 """, - }, - { - "current_date": "2023-08-16", - "table_name": "歌曲库", - "fields_list": """["歌曲名", "歌曲版本", "歌手名", "歌曲类型", "发布时间", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", - "question": "2021年6月以后发布的李雨纯的播放量大于20万的歌曲有哪些", - "prior_schema_links": """['23109'->MPPM歌手ID]""", +“播放量大于200万的”,所以我们需要column=[结算播放量], cell values = [2000000],所以有[结算播放量:(2000000)] +”2023年6月以后,张亮英“,所以我们需要column=[数据日期, 歌手名], cell values = ['2023-06-01', '张亮英'],所以有[数据日期:('2023-06-01'), 歌手名:('张亮英')], +”歌曲有哪些“,所以我们需要column=[歌曲名]""", + "schema_links":"""["结算播放量":(2000000), "数据日期":("'2023-06-01'"), "歌手名":("'张亮英'"), "歌曲名"]""", + "sql":"""select 歌曲名 from 歌曲库 where 数据日期 >= '2023-06-01' and 歌手名 = '张亮英' and 结算播放量 > 2000000""" + }, + { "current_date":"2023-08-16", + "table_name":"歌曲库", + "fields_list":"""["歌曲名", "歌曲版本", "歌手名", "歌曲类型", "发布时间", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", + "question":"2021年6月以后发布的李雨纯的播放量大于20万的歌曲有哪些", + "prior_schema_links":"""['23109'->MPPM歌手ID]""", "analysis": """让我们一步一步地思考。在问题“2021年6月以后发布的李雨纯的播放量大于20万的歌曲有哪些“中,我们被问: -“播放量大于20万的”,所以我们需要column=[结算播放量] -”2021年6月以后发布的“,所以我们需要column=[发布时间] -”李雨纯“,所以我们需要column=[歌手名] -基于table和columns,可能的cell values 是 = ['李雨纯', 2021, 6, 200000]。""", - "schema_links": """["结算播放量", "发布时间", "歌手名", "李雨纯", 2021, 6, 200000]""", - "sql": """select 歌曲名 from 歌曲库 where YEAR(发布时间) >= 2021 and MONTH(发布时间) >= 6 and 歌手名 = '李雨纯' and 结算播放量 > 200000 and 数据日期 = '2023-08-16'""", - }, - { - "current_date": "2023-08-16", - "table_name": "歌曲库", - "fields_list": """["歌曲名", "歌曲版本", "歌手名", "歌曲类型", "发布时间", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", - "question": "刘锝桦在1992年4月2日到2020年5月2日之间发布的播放量大于20万的歌曲有哪些", - "prior_schema_links": """['4234234'->MPPM歌手ID]""", +“播放量大于20万的”,所以我们需要column=[结算播放量], cell values = [200000],所以有[结算播放量:(200000)] +”2021年6月以后发布的“,所以我们需要column=[发布时间], cell values = ['2021-06-01'],所以有[发布时间:('2021-06-01')] +”李雨纯“,所以我们需要column=[歌手名], cell values = ['李雨纯'],所以有[歌手名:('李雨纯')]""", + "schema_links":"""["结算播放量":(200000), "发布时间":("'2021-06-01'"), "歌手名":("'李雨纯'")]""", + "sql":"""select 歌曲名 from 歌曲库 where 发布时间 >= '2021-06-01' and 歌手名 = '李雨纯' and 结算播放量 > 200000""" + }, + { "current_date":"2023-08-16", + "table_name":"歌曲库", + "fields_list":"""["歌曲名", "歌曲版本", "歌手名", "歌曲类型", "发布时间", "MPPM歌曲ID", "是否严选窄口径歌曲", "是否严选宽口径歌曲", "是否潮流人歌曲", "超声波歌曲ID", "C音歌曲ID", "C音歌曲MID", "结算播放量", "运营播放量", "分享量", "收藏量", "运营搜播量", "结算搜播量", "拉新用户数", "拉活用户数", "分享率", "结算播放份额", "数据日期"]""", + "question":"刘锝桦在1992年4月2日到2020年5月2日之间发布的播放量大于20万的歌曲有哪些", + "prior_schema_links":"""['4234234'->MPPM歌手ID]""", "analysis": """让我们一步一步地思考。在问题“刘锝桦在1992年4月2日到2020年5月2日之间发布的播放量大于20万的歌曲有哪些“中,我们被问: -“播放量大于20万的”,所以我们需要column=[结算播放量] -”1992年4月2日到2020年5月2日之间发布的“,所以我们需要column=[发布时间] -”刘锝桦“,所以我们需要column=[歌手名] -基于table和columns,可能的cell values 是 = ['刘锝桦', 1992, 4, 2, 2020, 5, 2, 200000]。""", - "schema_links": """["结算播放量", "发布时间", "歌手名", "刘锝桦", 1992, 4, 2, 2020, 5, 2, 200000]""", - "sql": """select 歌曲名 from 歌曲库 where YEAR(发布时间) >= 1992 and MONTH(发布时间) >= 4 and DAY(发布时间) >= 2 and YEAR(发布时间) <= 2020 and MONTH(发布时间) <= 5 and DAY(发布时间) <= 2 and 歌手名 = '刘锝桦' and 结算播放量 > 200000 and 数据日期 = '2023-08-16'""", - }, +“播放量大于20万的”,所以我们需要column=[结算播放量], cell values = [200000],所以有[结算播放量:(200000)] +”1992年4月2日到2020年5月2日之间发布的“, 所以我们需要column=[发布时间], cell values = ['1992-04-02', '2020-05-02'],所以有[发布时间:('1992-04-02', '2020-05-02')] +”刘锝桦“,所以我们需要column=[歌手名], cell values = ['刘锝桦'],所以有[歌手名:('刘锝桦')]""", + "schema_links":"""["结算播放量":(200000), "发布时间":("'1992-04-02'", "'2020-05-02'"), "歌手名":("'刘锝桦'")]""", + "sql":"""select 歌曲名 from 歌曲库 where 发布时间 >= '1992-04-02' and 发布时间 <= '2020-05-02' and 歌手名 = '刘锝桦' and 结算播放量 > 200000""" + }, { - "current_date": "2023-09-04", - "table_name": "内容库产品", - "fields_list": """["用户名", "部门", "模块", "访问时长", "访问次数", "访问人数", "数据日期"]""", - "question": "内容库近30天访问次数的平均数", - "prior_schema_links": """[]""", + "current_date":"2023-09-04", + "table_name":"内容库产品", + "fields_list":"""["用户名", "部门", "模块", "访问时长", "访问次数", "访问人数", "数据日期"]""", + "question":"内容库近30天访问次数的平均数", + "prior_schema_links":"""[]""", "analysis": """让我们一步一步地思考。在问题“内容库近30天访问次数的平均数“中,我们被问: “访问次数的平均数”,所以我们需要column=[访问次数] -”内容库近30天“,所以我们需要column=[数据日期] -基于table和columns,可能的cell values 是 = [30]。""", - "schema_links": """["访问次数", "数据日期", 30]""", - "sql": """select avg(访问次数) from 内容库产品 where datediff('day', 数据日期, '2023-09-04') <= 30 """, - }, +”内容库近30天“,所以我们需要column=[数据日期], cell values = [30],所以有[数据日期:(30)]""", + "schema_links":"""["访问次数", "数据日期":(30)]""", + "sql":"""select avg(访问次数) from 内容库产品 where datediff('day', 数据日期, '2023-09-04') <= 30 """ + }, { - "current_date": "2023-09-04", - "table_name": "内容库产品", - "fields_list": """["用户名", "部门", "模块", "访问时长", "访问次数", "访问人数", "数据日期"]""", - "question": "内容库近半年哪个月的访问次数汇总最高", - "prior_schema_links": """[]""", + "current_date":"2023-09-04", + "table_name":"内容库产品", + "fields_list":"""["用户名", "部门", "模块", "访问时长", "访问次数", "访问人数", "数据日期"]""", + "question":"内容库近半年哪个月的访问次数汇总最高", + "prior_schema_links":"""[]""", "analysis": """让我们一步一步地思考。在问题“内容库近半年哪个月的访问次数汇总最高“中,我们被问: -“访问次数汇总最高”,所以我们需要column=[访问次数] -”内容库近半年“,所以我们需要column=[数据日期] -基于table和columns,可能的cell values 是 = [0.5]。""", - "schema_links": """["访问次数", "数据日期", 0.5]""", - "sql": """select MONTH(数据日期), sum(访问次数) from 内容库产品 where datediff('year', 数据日期, '2023-09-04') <= 0.5 group by MONTH(数据日期) order by sum(访问次数) desc limit 1 """, - }, +“访问次数汇总最高”,所以我们需要column=[访问次数], cell values = [1],所以有[访问次数:(1)] +”内容库近半年“,所以我们需要column=[数据日期], cell values = [0.5],所以有[数据日期:(0.5)]""", + "schema_links":"""["访问次数":(1), "数据日期":(0.5)]""", + "sql":"""select MONTH(数据日期), sum(访问次数) from 内容库产品 where datediff('year', 数据日期, '2023-09-04') <= 0.5 group by MONTH(数据日期) order by sum(访问次数) desc limit 1""" + }, { - "current_date": "2023-09-04", - "table_name": "内容库产品", - "fields_list": """["用户名", "部门", "模块", "访问时长", "访问次数", "访问人数", "数据日期"]""", - "question": "内容库近半年每个月的平均访问次数", - "prior_schema_links": """[]""", + "current_date":"2023-09-04", + "table_name":"内容库产品", + "fields_list":"""["用户名", "部门", "模块", "访问时长", "访问次数", "访问人数", "数据日期"]""", + "question":"内容库近半年每个月的平均访问次数", + "prior_schema_links":"""[]""", "analysis": """让我们一步一步地思考。在问题“内容库近半年每个月的平均访问次数“中,我们被问: “每个月的平均访问次数”,所以我们需要column=[访问次数] -”内容库近半年“,所以我们需要column=[数据日期] -基于table和columns,可能的cell values 是 = [0.5]。""", - "schema_links": """["访问次数", "数据日期", 0.5]""", - "sql": """select MONTH(数据日期), avg(访问次数) from 内容库产品 where datediff('year', 数据日期, '2023-09-04') <= 0.5 group by MONTH(数据日期) """, - }, +”内容库近半年“,所以我们需要column=[数据日期], cell values = [0.5],所以有[数据日期:(0.5)]""", + "schema_links":"""["访问次数", "数据日期":(0.5)]""", + "sql":"""select MONTH(数据日期), avg(访问次数) from 内容库产品 where datediff('year', 数据日期, '2023-09-04') <= 0.5 group by MONTH(数据日期)""" + }, { - "current_date": "2023-09-10", - "table_name": "内容库产品", - "fields_list": """["用户名", "部门", "模块", "访问时长", "访问次数", "访问人数", "数据日期"]""", - "question": "内容库 按部门统计访问次数 top10 的部门", - "prior_schema_links": """[]""", + "current_date":"2023-09-10", + "table_name":"内容库产品", + "fields_list":"""["用户名", "部门", "模块", "访问时长", "访问次数", "访问人数", "数据日期"]""", + "question":"内容库 按部门统计访问次数 top10 的部门", + "prior_schema_links":"""[]""", "analysis": """让我们一步一步地思考。在问题“内容库 按部门统计访问次数 top10 的部门“中,我们被问: -“访问次数 top10 的部门”,所以我们需要column=[访问次数] -”内容库 按部门统计“,所以我们需要column=[部门] -基于table和columns,可能的cell values 是 = [10]。""", - "schema_links": """["访问次数", "部门", 10]""", - "sql": """select 部门, sum(访问次数) from 内容库产品 group by 部门 order by sum(访问次数) desc limit 10 """, - }, -] +“访问次数 top10 的部门”,所以我们需要column=[访问次数], cell values = [10],所以有[访问次数:(10)] +”内容库 按部门统计“,所以我们需要column=[部门]""", + "schema_links":"""["访问次数":(10), "部门"]""", + "sql":"""select 部门, sum(访问次数) from 内容库产品 group by 部门 order by sum(访问次数) desc limit 10""" + }, + { + "current_date":"2023-09-10", + "table_name":"内容库产品", + "fields_list":"""["用户名", "部门", "模块", "访问时长", "访问次数", "访问人数", "数据日期"]""", + "question":"超音速 近7个月,月度总访问量超过 2万的月份", + "prior_schema_links":"""[]""", + "analysis": """让我们一步一步地思考。在问题“超音速 近7个月,月度总访问量超过 2万的月份“中,我们被问: +“月度总访问量超过 2万的月份”,所以我们需要column=[访问次数], cell values = [20000],所以有[访问次数:(20000)] +”超音速 近7个月“,所以我们需要column=[数据日期], cell values = [7],所以有[数据日期:(7)]""", + "schema_links":"""["访问次数":(20000), "数据日期":(7)]""", + "sql":"""select MONTH(数据日期) from 内容库产品 where datediff('day', 数据日期, '2023-09-10') <= 7 group by MONTH(数据日期) having sum(访问次数) > 20000""" + }, + { + "current_date":"2023-09-10", + "table_name":"歌曲库", + "fields_list":"""["歌曲语言", "歌曲来源", "运营播放量", "播放量", "歌曲名", "结算播放量", "专辑名", "发布日期", "歌曲版本", "歌曲类型", "数据日期"]""", + "question":"2022年7月到2023年7月之间发布到歌曲,按播放量取top 100,再按月粒度来统计近1年的运营播放量", + "prior_schema_links":"""[]""", + "analysis": """让我们一步一步地思考。在问题“2022年7月到2023年7月之间发布到歌曲,按播放量取top 100,再按月粒度来统计近1年的运营播放量“中,我们被问: +“按月粒度来统计近1年的运营播放量”,所以我们需要column=[运营播放量, 数据日期], cell values = [1],所以有[运营播放量, 数据日期:(1)] +”按播放量取top 100“,所以我们需要column=[播放量], cell values = [100],所以有[播放量:(100)] +“2022年7月到2023年7月之间发布到歌曲”,所以我们需要column=[发布日期], cell values = ['2022-07-01', '2023-07-01'],所以有[发布日期:('2022-07-01', '2023-07-01')]""", + "schema_links":"""["运营播放量", "数据日期":(1), "播放量":(100), "发布日期":("'2022-07-01'", "'2023-07-01'")]""", + "sql":"""select MONTH(数据日期), sum(运营播放量) from (select 数据日期, 运营播放量 from 歌曲库 where 发布日期 >= '2022-07-01' and 发布日期 <= '2023-07-01' order by 播放量 desc limit 100) t where datediff('year', 数据日期, '2023-09-10') <= 1 group by MONTH(数据日期)""" + }, + { + "current_date":"2023-09-10", + "table_name":"歌曲库", + "fields_list":"""["歌曲语言", "歌曲来源", "运营播放量", "播放量", "歌曲名", "结算播放量", "专辑名", "发布日期", "歌曲版本", "歌曲类型", "数据日期"]""", + "question":"2022年7月到2023年7月之间发布到歌曲,按播放量取top100,再按月粒度来统计近1年的运营播放量之和,筛选出其中运营播放量之和大于2k的月份", + "prior_schema_links":"""[]""", + "analysis": """让我们一步一步地思考。在问题“2022年7月到2023年7月之间发布到歌曲,按播放量取top100,再按月粒度来统计近1年的运营播放量之和,筛选出其中运营播放量之和大于2k的月份“中,我们被问: +“筛选出其中运营播放量之和大于2k的月份”,所以我们需要column=[运营播放量], cell values = [2000],所以有[运营播放量:(2000)] +”按月粒度来统计近1年的运营播放量之和“,所以我们需要column=[数据日期], cell values = [1],所以有[数据日期:(1)] +”按播放量取top100“,所以我们需要column=[播放量], cell values = [100],所以有[播放量:(100)] +”2022年7月到2023年7月之间发布到歌曲“,所以我们需要column=[发布日期], cell values = ['2022-07-01', '2023-07-01'],所以有[发布日期:('2022-07-01', '2023-07-01')]""", + "schema_links":"""["运营播放量":(2000), "数据日期":(1), "播放量":(100), "发布日期":("'2022-07-01'", "'2023-07-01'")]""", + "sql":"""select MONTH(数据日期), sum(运营播放量) from (select 数据日期, 运营播放量 from 歌曲库 where 发布日期 >= '2022-07-01' and 发布日期 <= '2023-07-01' order by 播放量 desc limit 100) t where datediff('year', 数据日期, '2023-09-10') <= 1 group by MONTH(数据日期) having sum(运营播放量) > 2000""" + } +] \ No newline at end of file diff --git a/chat/core/src/main/python/run_config.py b/chat/core/src/main/python/run_config.py index f49a0dfeb..b8a51176e 100644 --- a/chat/core/src/main/python/run_config.py +++ b/chat/core/src/main/python/run_config.py @@ -14,6 +14,7 @@ TEMPERATURE = 0.0 CHROMA_DB_PERSIST_DIR = "chm_db" PRESET_QUERY_COLLECTION_NAME = "preset_query_collection" +SOLVED_QUERY_COLLECTION_NAME = "solved_query_collection" TEXT2DSL_COLLECTION_NAME = "text2dsl_collection" TEXT2DSL_FEW_SHOTS_EXAMPLE_NUM = 15 TEXT2DSL_IS_SHORTCUT = False diff --git a/chat/core/src/main/python/plugin_call/prompt_construct.py b/chat/core/src/main/python/services/plugin_call/prompt_construct.py similarity index 100% rename from chat/core/src/main/python/plugin_call/prompt_construct.py rename to chat/core/src/main/python/services/plugin_call/prompt_construct.py diff --git a/chat/core/src/main/python/plugin_call/run.py b/chat/core/src/main/python/services/plugin_call/run.py similarity index 100% rename from chat/core/src/main/python/plugin_call/run.py rename to chat/core/src/main/python/services/plugin_call/run.py diff --git a/chat/core/src/main/python/preset_retrieval/preset_query_db.py b/chat/core/src/main/python/services/preset_retrieval/preset_query_db.py similarity index 100% rename from chat/core/src/main/python/preset_retrieval/preset_query_db.py rename to chat/core/src/main/python/services/preset_retrieval/preset_query_db.py diff --git a/chat/core/src/main/python/preset_retrieval/run.py b/chat/core/src/main/python/services/preset_retrieval/run.py similarity index 100% rename from chat/core/src/main/python/preset_retrieval/run.py rename to chat/core/src/main/python/services/preset_retrieval/run.py diff --git a/chat/core/src/main/python/services/query_retrieval/run.py b/chat/core/src/main/python/services/query_retrieval/run.py new file mode 100644 index 000000000..0a0c82107 --- /dev/null +++ b/chat/core/src/main/python/services/query_retrieval/run.py @@ -0,0 +1,85 @@ +# -*- coding:utf-8 -*- + +import os +import sys +import uuid +from typing import Any, List, Mapping, Optional, Union + +sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) +sys.path.append(os.path.dirname(os.path.abspath(__file__))) + +import chromadb +from chromadb.config import Settings +from chromadb.api import Collection, Documents, Embeddings + +from util.text2vec import Text2VecEmbeddingFunction + +from run_config import SOLVED_QUERY_COLLECTION_NAME, PRESET_QUERY_COLLECTION_NAME +from util.chromadb_instance import (client, + get_chroma_collection_size, query_chroma_collection, + parse_retrieval_chroma_collection_query, chroma_collection_query_retrieval_format, + get_chroma_collection_by_ids, get_chroma_collection_size, + add_chroma_collection, update_chroma_collection, delete_chroma_collection_by_ids, + empty_chroma_collection_2) + +emb_func = Text2VecEmbeddingFunction() + +solved_query_collection = client.get_or_create_collection(name=SOLVED_QUERY_COLLECTION_NAME, + embedding_function=emb_func, + metadata={"hnsw:space": "cosine"} + ) # Get a collection object from an existing collection, by name. If it doesn't exist, create it. +print("init_solved_query_collection_size: ", get_chroma_collection_size(solved_query_collection)) + + +preset_query_collection = client.get_or_create_collection(name=PRESET_QUERY_COLLECTION_NAME, + embedding_function=emb_func, + metadata={"hnsw:space": "cosine"} + ) +print("init_preset_query_collection_size: ", get_chroma_collection_size(preset_query_collection)) + +class ChromaCollectionRetriever(object): + def __init__(self, collection:Collection): + self.collection = collection + + def retrieval_query_run(self, query_texts_list:List[str], + filter_condition:Mapping[str,str]=None, n_results:int=5): + + retrieval_res = query_chroma_collection(self.collection, query_texts_list, + filter_condition, n_results) + + parsed_retrieval_res = parse_retrieval_chroma_collection_query(retrieval_res) + parsed_retrieval_res_format = chroma_collection_query_retrieval_format(query_texts_list, parsed_retrieval_res) + + print('parsed_retrieval_res_format: ', parsed_retrieval_res_format) + + return parsed_retrieval_res_format + + def get_query_by_ids(self, query_ids:List[str]): + queries = get_chroma_collection_by_ids(self.collection, query_ids) + return queries + + def get_query_size(self): + return get_chroma_collection_size(self.collection) + + def add_queries(self, query_text_list:List[str], + query_id_list:List[str], metadatas:List[Mapping[str, str]]=None): + add_chroma_collection(self.collection, query_text_list, query_id_list, metadatas) + return True + + def update_queries(self, query_text_list:List[str], + query_id_list:List[str], metadatas:List[Mapping[str, str]]=None): + update_chroma_collection(self.collection, query_text_list, query_id_list, metadatas) + return True + + def delete_queries_by_ids(self, query_ids:List[str]): + delete_chroma_collection_by_ids(self.collection, query_ids) + return True + + def empty_query_collection(self): + self.collection = empty_chroma_collection_2(self.collection) + + return True + + +solved_query_retriever = ChromaCollectionRetriever(solved_query_collection) +preset_query_retriever = ChromaCollectionRetriever(preset_query_collection) diff --git a/chat/core/src/main/python/sql/constructor.py b/chat/core/src/main/python/services/sql/constructor.py similarity index 100% rename from chat/core/src/main/python/sql/constructor.py rename to chat/core/src/main/python/services/sql/constructor.py diff --git a/chat/core/src/main/python/sql/examples_reload_run.py b/chat/core/src/main/python/services/sql/examples_reload_run.py similarity index 100% rename from chat/core/src/main/python/sql/examples_reload_run.py rename to chat/core/src/main/python/services/sql/examples_reload_run.py diff --git a/chat/core/src/main/python/sql/output_parser.py b/chat/core/src/main/python/services/sql/output_parser.py similarity index 100% rename from chat/core/src/main/python/sql/output_parser.py rename to chat/core/src/main/python/services/sql/output_parser.py diff --git a/chat/core/src/main/python/sql/prompt_maker.py b/chat/core/src/main/python/services/sql/prompt_maker.py similarity index 100% rename from chat/core/src/main/python/sql/prompt_maker.py rename to chat/core/src/main/python/services/sql/prompt_maker.py diff --git a/chat/core/src/main/python/sql/run.py b/chat/core/src/main/python/services/sql/run.py similarity index 100% rename from chat/core/src/main/python/sql/run.py rename to chat/core/src/main/python/services/sql/run.py diff --git a/chat/core/src/main/python/services_router/plugin_call_service.py b/chat/core/src/main/python/services_router/plugin_call_service.py new file mode 100644 index 000000000..4a393192d --- /dev/null +++ b/chat/core/src/main/python/services_router/plugin_call_service.py @@ -0,0 +1,33 @@ +# -*- coding:utf-8 -*- +import os +import sys +from typing import Any, List, Mapping, Optional, Union + +sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) +sys.path.append(os.path.dirname(os.path.abspath(__file__))) + +from fastapi import APIRouter, Depends, HTTPException + +from services.plugin_call.run import plugin_selection_run + + +router = APIRouter() + +@router.post("/plugin_selection/") +async def tool_selection(query_body: Mapping[str, Any]): + if "queryText" not in query_body: + raise HTTPException(status_code=400, detail="query_text is not in query_body") + else: + query_text = query_body["queryText"] + + if "pluginConfigs" not in query_body: + raise HTTPException( + status_code=400, detail="pluginConfigs is not in query_body" + ) + else: + plugin_configs = query_body["pluginConfigs"] + + resp = plugin_selection_run(query_text=query_text, plugin_configs=plugin_configs) + + return resp + diff --git a/chat/core/src/main/python/services_router/preset_query_service.py b/chat/core/src/main/python/services_router/preset_query_service.py new file mode 100644 index 000000000..78afc0f64 --- /dev/null +++ b/chat/core/src/main/python/services_router/preset_query_service.py @@ -0,0 +1,71 @@ +# -*- coding:utf-8 -*- +import os +import sys +from typing import Any, List, Mapping, Optional, Union + +sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) +sys.path.append(os.path.dirname(os.path.abspath(__file__))) + +from fastapi import APIRouter, Depends, HTTPException + +from services.query_retrieval.run import preset_query_retriever + +router = APIRouter() + +@router.post("/preset_query_retrival") +def preset_query_retrival(query_text_list: List[str], n_results: int = 5): + parsed_retrieval_res_format = preset_query_retriever.retrieval_query_run(query_texts_list=query_text_list, filter_condition=None, n_results=n_results) + + return parsed_retrieval_res_format + + +@router.post("/preset_query_add") +def preset_query_add(preset_info_list: List[Mapping[str, str]]): + preset_queries = [] + preset_query_ids = [] + + for preset_info in preset_info_list: + preset_queries.append(preset_info['preset_query']) + preset_query_ids.append(preset_info['preset_query_id']) + + preset_query_retriever.add_queries(query_text_list=preset_queries, query_id_list=preset_query_ids, metadatas=None) + + return "success" + +@router.post("/preset_query_update") +def preset_query_update(preset_info_list: List[Mapping[str, str]]): + preset_queries = [] + preset_query_ids = [] + + for preset_info in preset_info_list: + preset_queries.append(preset_info['preset_query']) + preset_query_ids.append(preset_info['preset_query_id']) + + preset_query_retriever.update_queries(query_text_list=preset_queries, query_id_list=preset_query_ids, metadatas=None) + + return "success" + + +@router.get("/preset_query_empty") +def preset_query_empty(): + preset_query_retriever.empty_query_collection() + + return "success" + +@router.post("/preset_delete_by_ids") +def preset_delete_by_ids(preset_query_ids: List[str]): + preset_query_retriever.delete_queries_by_ids(preset_query_ids) + + return "success" + +@router.post("/preset_get_by_ids") +def preset_get_by_ids(preset_query_ids: List[str]): + preset_queries = preset_query_retriever.get_query_by_ids(preset_query_ids) + + return preset_queries + +@router.get("/preset_query_size") +def preset_query_size(): + size = preset_query_retriever.get_query_size() + + return size diff --git a/chat/core/src/main/python/services_router/query2sql_service.py b/chat/core/src/main/python/services_router/query2sql_service.py new file mode 100644 index 000000000..556e9f3cc --- /dev/null +++ b/chat/core/src/main/python/services_router/query2sql_service.py @@ -0,0 +1,66 @@ +# -*- coding:utf-8 -*- +import os +import sys +from typing import Any, List, Mapping, Optional, Union + +sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) +sys.path.append(os.path.dirname(os.path.abspath(__file__))) + +from fastapi import APIRouter, Depends, HTTPException + +from services.sql.run import text2sql_agent + +router = APIRouter() + + +@router.post("/query2sql/") +def din_query2sql(query_body: Mapping[str, Any]): + if "queryText" not in query_body: + raise HTTPException(status_code=400, detail="query_text is not in query_body") + else: + query_text = query_body["queryText"] + + if "schema" not in query_body: + raise HTTPException(status_code=400, detail="schema is not in query_body") + else: + schema = query_body["schema"] + + if "currentDate" not in query_body: + raise HTTPException(status_code=400, detail="currentDate is not in query_body") + else: + current_date = query_body["currentDate"] + + if "linking" not in query_body: + linking = None + else: + linking = query_body["linking"] + + resp = text2sql_agent.query2sql_run( + query_text=query_text, schema=schema, current_date=current_date, linking=linking + ) + + return resp + + +@router.post("/query2sql_setting_update/") +def query2sql_setting_update(query_body: Mapping[str, Any]): + if "sqlExamplars" not in query_body: + raise HTTPException(status_code=400, detail="sqlExamplars is not in query_body") + else: + sql_examplars = query_body["sqlExamplars"] + + if "exampleNums" not in query_body: + raise HTTPException(status_code=400, detail="exampleNums is not in query_body") + else: + example_nums = query_body["exampleNums"] + + if "isShortcut" not in query_body: + raise HTTPException(status_code=400, detail="isShortcut is not in query_body") + else: + is_shortcut = query_body["isShortcut"] + + text2sql_agent.update_examples( + sql_examples=sql_examplars, example_nums=example_nums, is_shortcut=is_shortcut + ) + + return "success" diff --git a/chat/core/src/main/python/services_router/solved_query_service.py b/chat/core/src/main/python/services_router/solved_query_service.py new file mode 100644 index 000000000..39f7ac74f --- /dev/null +++ b/chat/core/src/main/python/services_router/solved_query_service.py @@ -0,0 +1,80 @@ +# -*- coding:utf-8 -*- +import os +import sys +from typing import Any, List, Mapping, Optional, Union + +sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) +sys.path.append(os.path.dirname(os.path.abspath(__file__))) + +from fastapi import APIRouter, Depends, HTTPException + +from services.query_retrieval.run import solved_query_retriever + +router = APIRouter() + +@router.post("/solved_query_retrival") +def solved_query_retrival(query_info: Mapping[str, Any], n_results: int = 5): + query_texts_list = query_info['queryTextsList'] + filter_condition = query_info['filterCondition'] + + parsed_retrieval_res_format = solved_query_retriever.retrieval_query_run(query_texts_list=query_texts_list, + filter_condition=filter_condition, + n_results=n_results) + + return parsed_retrieval_res_format + + +@router.post("/solved_query_add") +def add_solved_queries(sovled_query_info_list: List[Mapping[str, Any]]): + queries = [] + query_ids = [] + metadatas = [] + + for sovled_query_info in sovled_query_info_list: + queries.append(sovled_query_info['query']) + query_ids.append(sovled_query_info['query_id']) + metadatas.append(sovled_query_info['metadata']) + + solved_query_retriever.add_queries(query_text_list=queries, query_id_list=query_ids, metadatas=metadatas) + + return "success" + +@router.post("/solved_query_update") +def solved_query_update(sovled_query_info_list: List[Mapping[str, Any]]): + queries = [] + query_ids = [] + metadatas = [] + + for sovled_query_info in sovled_query_info_list: + queries.append(sovled_query_info['query']) + query_ids.append(sovled_query_info['query_id']) + metadatas.append(sovled_query_info['metadata']) + + solved_query_retriever.update_queries(query_text_list=queries, query_id_list=query_ids, metadatas=metadatas) + + return "success" + + +@router.get("/solved_query_empty") +def solved_query_empty(): + solved_query_retriever.empty_query_collection() + + return "success" + +@router.post("/solved_query_delete_by_ids") +def solved_query_delete_by_ids(query_ids: List[str]): + solved_query_retriever.delete_queries_by_ids(query_ids=query_ids) + + return "success" + +@router.post("/solved_query_get_by_ids") +def solved_query_get_by_ids(query_ids: List[str]): + queries = solved_query_retriever.get_query_by_ids(query_ids=query_ids) + + return queries + +@router.get("/solved_query_size") +def solved_query_size(): + size = solved_query_retriever.get_query_size() + + return size diff --git a/chat/core/src/main/python/supersonic_llmparser.py b/chat/core/src/main/python/supersonic_llmparser.py index 42cfe1ce6..66587e7ad 100644 --- a/chat/core/src/main/python/supersonic_llmparser.py +++ b/chat/core/src/main/python/supersonic_llmparser.py @@ -11,177 +11,18 @@ from typing import Any, List, Mapping from fastapi import FastAPI, HTTPException -from sql.run import text2sql_agent +from run_config import LLMPARSER_HOST, LLMPARSER_PORT -from preset_retrieval.run import ( - preset_query_retrieval_run, - collection as preset_query_collection, -) -from preset_retrieval.preset_query_db import ( - add2preset_query_collection, - empty_preset_query_collection, - delete_preset_query_by_ids, - update_preset_query_collection, - get_preset_query_by_ids, - preset_query_collection_size, -) +from services_router import (query2sql_service, preset_query_service, + solved_query_service, plugin_call_service) -from plugin_call.run import plugin_selection_run - -from run_config import LLMPARSER_HOST -from run_config import LLMPARSER_PORT app = FastAPI() - -@app.post("/query2sql/") -async def din_query2sql(query_body: Mapping[str, Any]): - if "queryText" not in query_body: - raise HTTPException(status_code=400, detail="query_text is not in query_body") - else: - query_text = query_body["queryText"] - - if "schema" not in query_body: - raise HTTPException(status_code=400, detail="schema is not in query_body") - else: - schema = query_body["schema"] - - if "currentDate" not in query_body: - raise HTTPException(status_code=400, detail="currentDate is not in query_body") - else: - current_date = query_body["currentDate"] - - if "linking" not in query_body: - linking = None - else: - linking = query_body["linking"] - - resp = text2sql_agent.query2sql_run( - query_text=query_text, schema=schema, current_date=current_date, linking=linking - ) - - return resp - - -@app.post("/query2sql_setting_update/") -async def query2sql_setting_update(query_body: Mapping[str, Any]): - if "sqlExamplars" not in query_body: - raise HTTPException(status_code=400, detail="sqlExamplars is not in query_body") - else: - sql_examplars = query_body["sqlExamplars"] - - if "exampleNums" not in query_body: - raise HTTPException(status_code=400, detail="exampleNums is not in query_body") - else: - example_nums = query_body["exampleNums"] - - if "isShortcut" not in query_body: - raise HTTPException(status_code=400, detail="isShortcut is not in query_body") - else: - is_shortcut = query_body["isShortcut"] - - text2sql_agent.update_examples( - sql_examples=sql_examplars, example_nums=example_nums, is_shortcut=is_shortcut - ) - - return "success" - - -@app.post("/preset_query_retrival/") -async def preset_query_retrival(query_text_list: List[str], n_results: int = 5): - parsed_retrieval_res_format = preset_query_retrieval_run( - preset_query_collection, query_text_list, n_results - ) - - return parsed_retrieval_res_format - - -@app.post("/preset_query_add/") -async def preset_query_add(preset_info_list: List[Mapping[str, str]]): - preset_queries = [] - preset_query_ids = [] - - for preset_info in preset_info_list: - preset_queries.append(preset_info["preset_query"]) - preset_query_ids.append(preset_info["preset_query_id"]) - - add2preset_query_collection( - collection=preset_query_collection, - preset_queries=preset_queries, - preset_query_ids=preset_query_ids, - ) - - return "success" - - -@app.post("/preset_query_update/") -async def preset_query_update(preset_info_list: List[Mapping[str, str]]): - preset_queries = [] - preset_query_ids = [] - - for preset_info in preset_info_list: - preset_queries.append(preset_info["preset_query"]) - preset_query_ids.append(preset_info["preset_query_id"]) - - update_preset_query_collection( - collection=preset_query_collection, - preset_queries=preset_queries, - preset_query_ids=preset_query_ids, - ) - - return "success" - - -@app.get("/preset_query_empty/") -async def preset_query_empty(): - empty_preset_query_collection(collection=preset_query_collection) - - return "success" - - -@app.post("/preset_delete_by_ids/") -async def preset_delete_by_ids(preset_query_ids: List[str]): - delete_preset_query_by_ids( - collection=preset_query_collection, preset_query_ids=preset_query_ids - ) - - return "success" - - -@app.post("/preset_get_by_ids/") -async def preset_get_by_ids(preset_query_ids: List[str]): - preset_queries = get_preset_query_by_ids( - collection=preset_query_collection, preset_query_ids=preset_query_ids - ) - - return preset_queries - - -@app.get("/preset_query_size/") -async def preset_query_size(): - size = preset_query_collection_size(collection=preset_query_collection) - - return size - - -@app.post("/plugin_selection/") -async def tool_selection(query_body: Mapping[str, Any]): - if "queryText" not in query_body: - raise HTTPException(status_code=400, detail="query_text is not in query_body") - else: - query_text = query_body["queryText"] - - if "pluginConfigs" not in query_body: - raise HTTPException( - status_code=400, detail="pluginConfigs is not in query_body" - ) - else: - plugin_configs = query_body["pluginConfigs"] - - resp = plugin_selection_run(query_text=query_text, plugin_configs=plugin_configs) - - return resp - +app.include_router(preset_query_service.router) +app.include_router(solved_query_service.router) +app.include_router(query2sql_service.router) +app.include_router(plugin_call_service.router) if __name__ == "__main__": uvicorn.run(app, host=LLMPARSER_HOST, port=LLMPARSER_PORT) diff --git a/chat/core/src/main/python/util/chromadb_instance.py b/chat/core/src/main/python/util/chromadb_instance.py index 35dd4fa7b..26a4bffc8 100644 --- a/chat/core/src/main/python/util/chromadb_instance.py +++ b/chat/core/src/main/python/util/chromadb_instance.py @@ -1,4 +1,5 @@ # -*- coding:utf-8 -*- +from typing import Any, List, Mapping, Optional, Union import chromadb from chromadb.api import Collection @@ -14,7 +15,7 @@ client = chromadb.Client( ) -def empty_chroma_collection_2(collection: Collection): +def empty_chroma_collection_2(collection:Collection): collection_name = collection.name client = collection._client metadata = collection.metadata @@ -22,18 +23,113 @@ def empty_chroma_collection_2(collection: Collection): client.delete_collection(collection_name) - new_collection = client.get_or_create_collection( - name=collection_name, metadata=metadata, embedding_function=embedding_function - ) + new_collection = client.get_or_create_collection(name=collection_name, + metadata=metadata, + embedding_function=embedding_function) size_of_new_collection = new_collection.count() - print( - f"Collection {collection_name} emptied. Size of new collection: {size_of_new_collection}" - ) + print(f'Collection {collection_name} emptied. Size of new collection: {size_of_new_collection}') return new_collection -def empty_chroma_collection(collection: Collection): +def empty_chroma_collection(collection:Collection) -> None: collection.delete() + + +def add_chroma_collection(collection:Collection, + queries:List[str], + query_ids:List[str], + metadatas:List[Mapping[str, str]]=None + ) -> None: + + collection.add(documents=queries, + ids=query_ids, + metadatas=metadatas) + + +def update_chroma_collection(collection:Collection, + queries:List[str], + query_ids:List[str], + metadatas:List[Mapping[str, str]]=None + ) -> None: + + collection.update(documents=queries, + ids=query_ids, + metadatas=metadatas) + + +def query_chroma_collection(collection:Collection, query_texts:List[str], + filter_condition:Mapping[str,str]=None, n_results:int=10): + outer_opt = '$and' + inner_opt = '$eq' + + if filter_condition is not None: + if len(filter_condition)==1: + outer_filter = filter_condition + else: + inner_filter = [{_k: {inner_opt:_v}} for _k, _v in filter_condition.items()] + outer_filter = {outer_opt: inner_filter} + else: + outer_filter = None + + print('outer_filter: ', outer_filter) + res = collection.query(query_texts=query_texts, n_results=n_results, where=outer_filter) + return res + + +def parse_retrieval_chroma_collection_query(res:List[Mapping[str, Any]]): + parsed_res = [[] for _ in range(0, len(res['ids']))] + + retrieval_ids = res['ids'] + retrieval_distances = res['distances'] + retrieval_sentences = res['documents'] + retrieval_metadatas = res['metadatas'] + + for query_idx in range(0, len(retrieval_ids)): + id_ls = retrieval_ids[query_idx] + distance_ls = retrieval_distances[query_idx] + sentence_ls = retrieval_sentences[query_idx] + metadata_ls = retrieval_metadatas[query_idx] + + for idx in range(0, len(id_ls)): + id = id_ls[idx] + distance = distance_ls[idx] + sentence = sentence_ls[idx] + metadata = metadata_ls[idx] + + parsed_res[query_idx].append({ + 'id': id, + 'distance': distance, + 'query': sentence, + 'metadata': metadata + }) + + return parsed_res + +def chroma_collection_query_retrieval_format(query_list:List[str], retrieval_list:List[Mapping[str, Any]]): + res = [] + for query_idx in range(0, len(query_list)): + query = query_list[query_idx] + retrieval = retrieval_list[query_idx] + + res.append({ + 'query': query, + 'retrieval': retrieval + }) + + return res + + +def delete_chroma_collection_by_ids(collection:Collection, query_ids:List[str]) -> None: + collection.delete(ids=query_ids) + +def get_chroma_collection_by_ids(collection:Collection, query_ids:List[str]): + res = collection.get(ids=query_ids) + + return res + +def get_chroma_collection_size(collection:Collection) -> int: + return collection.count() +